Search CORE

538 research outputs found

Linguistically informed and corpus informed morphological analysis of Arabic

Author: Atwell ES
Sawalha M
Publication venue: Lancaster University Centre for Computer Corpus Research on Language
Publication date: 01/01/2009
Field of study

Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, e.g. PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoS-taggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and postclitics). Many challenges face the implementation of Arabic morphology, the rich “root-and-pattern” nonconcatenative (or nonlinear) morphology and the highly complex word formation process of root and patterns, especially if one or two long vowels are part of the root letters. Moreover, the orthographic issues of Arabic such as short vowels ( َ ُ ِ ), Hamzah (ء أ إ ؤ ئ), Taa’ Marboutah ( ة ) and Ha’ ( ه ), Ya’ ( ي ) and Alif Maksorah( ى ) , Shaddah ( ّ ) or gemination, and Maddah ( آ ) or extension which is a compound letter of Hamzah and Alif ( أا ). Our morphological analyzer uses linguistic knowledge of the language as well as corpora to verify the linguistic information. To understand the problem, we started by analyzing fifteen established Arabic language dictionaries, to build a broad-coverage lexicon which contains not only roots and single words but also multi-word expressions, idioms, collocations requiring special part-of-speech assignment, and words with special part-of-speech tags. The next stage of research was a detailed analysis and classification of Arabic language roots to address the “tail” of hard cases for existing morphological analyzers, and analysis of the roots, word-root combinations and the coverage of each root category of the Qur’an and the word-root information stored in our lexicon. From authoritative Arabic grammar books, we extracted and generated comprehensive lists of affixes, clitics and patterns. These lists were then cross-checked by analyzing words of three corpora: the Qur’an, the Corpus of Contemporary Arabic and Penn Arabic Treebank (as well as our Lexicon, considered as a fourth cross-check corpus). We also developed a novel algorithm that generates the correct pattern of the words, which deals with the orthographic issues of the Arabic language and other word derivation issues, such as the elimination or substitution of root letters

White Rose Research Online

Comparative evaluation of Arabic language morphological analysers and stemmers

Author: Atwell E.S.
Sawalha M.
Publication venue: Coling 2008 Organizing Committee
Publication date: 01/01/2008
Field of study

Arabic morphological analysers and stemming algorithms have become a popular area of research. Many computational linguists have designed and developed algorithms to solve the problem of morphology and stemming. Each researcher proposed his own gold standard, testing methodology and accuracy measurements to test and compute the accuracy of his algorithm. Therefore, we cannot make comparisons between these algorithms. In this paper we have accomplished two tasks. First, we proposed four different fair and precise accuracy measurements and two 1000-word gold standards taken from the Holy Qur’an and from the Corpus of Contemporary Arabic. Second, we combined the results from the morphological analysers and stemming algorithms by voting after running them on the sample documents. The evaluation of the algorithms shows that Arabic morphology is still a challenge

CiteSeerX

White Rose Research Online

Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic

Author: Atwell E.S.
Sawalha M.
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction. We collected 23 machine-readable lexicons, which are freely available on the web. We combined lexical resources into one large broad-coverage lexical resource by extracting information from disparate formats and merging traditional Arabic lexicons. To evaluate the broad-coverage lexical resource we computed coverage over the Qur’an, the Corpus of Contemporary Arabic, and a sample from the Arabic Web Corpus, using two methods. Counting exact word matches between test corpora and lexicon scored about 65-68%; Arabic has a rich morphology with many combinations of roots, affixes and clitics, so about a third of words in the corpora did not have an exact match in the lexicon. The second approach is to compute coverage in terms of use in a lemmatizer program, which strips clitics to look for a match for the underlying lexeme; this scored about 82-85%

White Rose Research Online

Comparative evaluation of Arabic language morphological analysers and stemmers

Author: Sawalha M.
Atwell E.S.
Publication venue: Coling 2008 Organizing Committee
Publication date: 01/01/2008
Field of study

MIT Libraries Dome

White Rose Research Online

Fine-grain morphological analyzer and part-of-speech tagger for Arabic text

Author: Atwell ES
Sawalha M
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. Our aim is to develop a part-of-speech tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. In this paper we review existing Arabic Part-of-Speech Taggers and tag-sets, and illustrate four different Arabic PoS tag-sets for a sample of Arabic text from the Quran. We describe the detailed fine-grained morphological feature tag set of Arabic, and the fine-grained Arabic morphological analyzer algorithm. We faced practical challenges in applying the morphological analyzer to the 100-million-word Web Arabic Corpus: we had to port the software to the National Grid Service, adapt the analyser to cope with spelling variations and errors, and utilise a Broad-Coverage Lexical Resource combining 23 traditional Arabic lexicons. Finally we outline the construction of a Gold Standard for comparative evaluation

CiteSeerX

White Rose Research Online

Antidote Stocking at Hospitals in North Palestine

Author: Al-Jabi W Samah
Sawalha F Ansam
Sweileh M Waleed
Zyoud H Sa'ed
Publication venue: Dr. B.S. Kakkilaya
Publication date: 01/03/2007
Field of study

Objective: The purpose of this study was to determine the availability and adequacy of antidote stocking at hospitals in north Palestine based on published guidelines for antidote stocking. Methodology: This study is a cross sectional survey of all hospitals at north Palestine (n=11) using a questionnaire which was completed by the director of the pharmacy department at each hospital. The questionnaire was divided into 2 parts. The first part contained a list of 25 antidotes while the second part contained a list of 12 antidotes. This classification is based on the guideline proposed by the British Association for Emergency Medicine (BAEM). The net antidote stock results were compared with the American guidelines as well. Result: The overall availability of each antidote in the first list varied widely from zero for glucagon to 100% for atropine. The number antidotes of the first list that were stocked in the 11 hospitals ranged from 5 to 12 antidotes but none of the hospitals stocked all the 25 antidotes. Additionally, availability of antidotes in the second list varied widely from zero for polyethylene glycol to 100% for dobutamine. The number of antidotes stocked ranged from 5 to 9 but none of the hospitals stocked all the 12 antidotes. Discussion and Conclusion: hospitals in north Palestine do not have adequate stock of antidotes. Raising awareness of the importance of antidotes by education, regular review of antidote storage, distribution plans, and appropriate legislation might provide solutions. Coordination between Palestinian hospitals and the PCDIC at An-Najah National University is also important

Directory of Open Access Journals

CogPrints Cognitive Sciences Eprint Archive

Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur'an dataset for machine learning (version 2.0)

Author: Atwell E
Brierley C
Sawalha M
Publication venue
Publication date: 01/01/2014
Field of study

In this paper, we augment the Boundary Annotated Qur?an dataset published at LREC 2012 (Brierley et al 2012; Sawalha et al 2012a) with automatically generated phonemic transcriptions of Arabic words. We have developed and evaluated a comprehensive grapheme-phoneme mapping from Standard Arabic \ensuremath> IPA (Brierley et al under review), and implemented the mapping in Arabic transcription technology which achieves 100% accuracy as measured against two gold standards: one for Qur?anic or Classical Arabic, and one for Modern Standard Arabic (Sawalha et al [1]). Our mapping algorithm has also been used to generate a pronunciation guide for a subset of Qur?anic words with heightened prosody (Brierley et al 2014). This is funded research under the EPSRC " Working Together" theme

White Rose Research Online

Leeds Beckett Repository

Tools for Arabic Natural Language Processing: a case study in qalqalah prosody

Author: Atwell E
Brierley C
Sawalha M
Publication venue: European Language Resources Association
Publication date: 01/01/2014
Field of study

In this paper, we focus on the prosodic effect of qalqalah or "vibration" applied to a subset of Arabic consonants under certain constraints during correct Qur'anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al 2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified orthographically in the Arabic script. Hence they can be given abstract definition in the form of regular expressions and thus located and collected automatically. High frequency qalqalah content words are also found to be statistically significant discriminators or keywords when comparing Meccan and Medinan chapters in the Qur'an using a state-of-the-art Visual Analytics toolkit: Semantic Pathways. Thus we hypothesise that qalqalah prosody is one way of highlighting salient items in the text. Finally, we implement Arabic transcription technology (Brierley et al under review; Sawalha et al forthcoming) to create a qalqalah pronunciation guide where each word is transcribed phonetically in IPA and mapped to its chapter-verse ID. This is funded research under the EPSRC "Working Together" theme

White Rose Research Online

Scrapie-resistant sheep show certain coat colour characteristics

Author: Bell L.
Brotherstone S.
Sawalha R. M.
Villanueva B.
White I.
Wilson A. J.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2009
Field of study

Edinburgh Research Explorer

SRUC - Scotland's Rural College

A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

Author: Al-Ghalayyni
Al-Sulaiti Latifa
Atwell Eric
Atwell Eric
Brill Eric
Cachia Pierre
Dahdah Antonie
Dahdah Antonie
Dukes Kais
Dukes Kais
Elliott John
Eric Atwell
Habash Nizar
Hamada Salwa
Harmain Harmain M.
Johansson Stig
Khoja Shereen
Majdi Sawalha
Sawalha Majdi
Sawalha Majdi
Sawalha Majdi
Sawalha Majdi
Talmon Rafi
Teahan Bill
Voutilainen Atro
Wright W.
Publication venue: 'Edinburgh University Press'
Publication date: 01/04/2013
Field of study

The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora

Crossref

White Rose Research Online